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Abstract 

In two-way contingency tables we sometimes find that frequencies along the 
diagonal cells are relatively larger (or smaller) compared to off-diagonal cells, par- 
ticularly in square tables with the common categories for the rows and the columns. 
In this case the quasi-independence model with an additional parameter for each 
of the diagonal cells is usually fitted to the data. A simpler model than the quasi- 
independence model is to assume a common additional parameter for all the diag- 
onal cells. We consider testing the goodness of fit of the common diagonal effect 
by Markov chain Monte Carlo (MCMC) method. We derive an explicit form of a 
Markov basis for performing the conditional test of the common diagonal effect. 
Once a Markov basis is given, MCMC procedure can be easily implemented by 
techniques of algebraic statistics. We illustrate the procedure with some real data 
sets. 



Introduction 

this paper we discuss a conditional test of a common effect for diagonal cells in two- 
y contingency tables. Modeling diagonal effects arises mainly in analyzing contingency 
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tables with common categories for the rows and the columns, although our approach is 
applicable to gener al rectangu l ar ta bles. Many models have been proposed for square 



contingency tables. Tomizawal | 2006l | gives a comprehensive review of models for square 



Gao and Kurikil [2006| discuss testing marginal homogeneity against 



contingency tables, 
ordered alternatives. 

Goodness of fit tests of these models are usually performed based on the large sample 
approximation to the null distribution of test statistics. However when a model is ex- 
pressed in a log-linear form of the cell probabilities, a conditional testing procedure (e.g. 
the Fisher's exact test for 2 x 2 cont ingency tables) can be u s ed. Q ptimality of conditional 



tests is a well-known classical fact Lehmann and Romanol . 12005k Chapter 4] . Also large 



sampl e approximation may be poor when expected cell frequencies are small flHaberman 



1988]) 



Sturmfelsl [19961 ] and iDiaconis and Sturmfelsl [19981 ] developed an algebraic algorithm 



for sampling from conditional distributions for a statistical model of discrete exponential 
families. This algorithm is applied to conditional tests through the notion of Markov 
bases. In the Markov chain Monte Carlo approach for testing statistical fitting of the 
given model, a Markov basis is a set of moves connecting all contingency tables satisfying 
the given margins. Since then many researchers have extensively st udied the structure of 
Markov bases for models in computational alge b raic statistics (e.g. Hosten and Sullivantl 



2002j . lDobral [20031 ] . iDobra and Sullivantl [20041 ] . iGeiger et all [20061 ] . iHara et all [2007aj ). 



It has been well-known that for two-way contingency tables with fixed row sums and 
column sums the set of square-free moves of degree two of the form 



constitutes a Markov basis. However when we impose an additional constraint that the 
sum of cell frequencies o f a subtable S is al so fixed, then these moves do not necessarily 
form a Markov basis. In lHara et al.l 2007bl ] we gave a necessary and sufficient condition 
on S so that the set of square-free moves of degree two forms a Markov basis. We called 
this problem a subtable sum problem. For the common diagonal effect model defined below 
in (J2J) S is t he set of diagonal cells. We call this problem a diagonal sum problem. By 
the result of lHara et al.l 2007bj | we know that the set of square-free moves of degree two 
does not form a Markov basis for the diagonal sum problem. In this paper we give an 
explicit form of a Markov basis for the two-way diagonal sum problem. The Markov basis 
contains moves of degree three and four. 

When the sum of cell frequencies of a subtable S is fixed to zero, then the frequency of 
each cell of S has to be zero and the subtable sum problem reduces to the structural zero 
case. Contingency tables with structural zero cells are called incomplete contingency ta- 
bles ( [Bishop et al. . 1975 . Chapter 5]). From the viewpoint of Markov bases, the subtable 
sum problem is a generalization of the problem c oncerning structural zeros. Properties of 



Markov bases for inco mplete tables are studied in lAoki and Takemural [20051 ]. iHuber et al. 



2006^ . iRapallol [2006 



This paper is organized as follows; In Section [2J we introduce the common diagonal 
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effect model as a submodel of the quasi-independence model. In Section [31 we summarize 
some preliminary facts on algebraic statistics and Markov bases. Section H] shows a Markov 
basis for contingency tables with fixed row sums, column sums, and the sum of diagonal 
cells. Numerical examples with some real data sets are given in Section [5J We conclude 
this paper with some remarks in Section 

2 Quasi-independence model and the common diag- 
onal effect model for two-way contingency tables 

Consider an R x C two-way contingency table x = {x^}, i = 1, . . . , R, j = 1, . . . , C, 
where frequen cies along the diagonal cells are relatively larger compared to off-diagonal 



cells. Tabled] [Agrestil . 120021 . Section 10.5] shows agreement between two pathologists in 



their diagnoses of carcinoma. We naturally see the tendency that two pathologist agree 



Table 1: Diagnoses of carcinoma 





1 


2 


3 


4 


1 


22 


2 


2 





2 


5 


7 


14 





3 





2 


36 





4 





1 


17 


10 



in their diagnoses. Usually the quasi-independence model is fitted to this type of data. 
In the quasi-independence model, the cell probabilities {pij} are modeled as 

log pij = fi + ai + Pj + jiSij, (1) 

where 5ij is Kronecker's delta. In ([T]) each diagonal cell i = 1, . . . , min(i?, C), has 

its own free parameter 7$. This implies that in the maximum likelihood estimation each 
diagonal cell is perfectly fitted: 

_ "^it 

Pa j 
n 

where n = J2f=i Z)j=i x v ls ^ ne total frequency. 

As a simpler submodel of the quasi-independence model we consider the null hypothesis 

H: 7 = 7i , i = l,...,mm{R,C), (2) 

in the quasi-independence model. We call this model a common diagonal effect model 
and abbreviate it as CDEM hereafter. In CDEM the tendency of the diagonal cells is 
expressed by a single parameter, rather than perfect fits to diagonal cells. We present some 
numerical examples of testing CDEM against the quasi-independence model in Section [51 
Both quasi-independence models and CDEM are usually applied to square contingency 
tables, i.e., R = C. As shown in Section 4, however, Markov bases of CDEM does not 
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essentially depend on the assumption R = C. Therefore, in this article, we consider more 
general cases, i.e., C . 

Under CDEM the sufficient statistic consists of the row sums, column sums and the 
sum of the diagonal frequencies: 

C R mm(R,C) 

j=l i=l i=l 

We write the sufficient statistic column vector 

t = . . . , X R+ , X + x,..., x +c , x s )'. 

We also order the elements of x lexicographically and regard a? as a column vector. Then 
with an appropriate matrix As consisting of O's and l's we can write 

t = A s x. 



3 Preliminaries on Markov bases 



In this section we summarize so me preliminary definitions and notations on Markov bases 
(jDiaconis and Sturmfelsl 1998|). By now M arkov bases and their uses are discussed in 



many papers. See lAoki and Takemural |2005| for example 



The set of contingency tables x sharing the same sufficient statistic 

F t = {x>0\t = A s x} 

is called a t-fiber. An integer table z is a move for As if = Asz. By adding a move z to 
x G Ft, we remain in the same fiber jF t provided that x + z does not contain a negative 
cell. A finite set of moves B = {z\, . . . , Zl} is a Markov basis, if for every t, jF t becomes 
connected by £>, i.e., we can move all over Tt by adding or subtracting the moves from B 
to contingency tables in T t . 

If z is a move then — z is a move as well. For convenience we add — z to B whenever 
z G B and only consider sign-invariant Markov bases in this paper. A Markov basis B is 
minimal, if every proper sign-invariant subset of B is no longer a Markov basis. A move 
z is called indispensable if z has to belong to every Markov basis. Otherwise z is called 
dispensable. 

A move z has positive elements and negative elements. Separating these elements 



we write z 



where 



max(— Zij, 0) is the negative part of z. 



max 
and 



[Zij,Q) is the positive part and (z 
z~ belong to the sa me fiber. 



We next discuss the notion o f distance reduction by a move (lAoki and Takemura 



20031 ] . iTakemura and Aokil |2005j . lHara et all |2007bj l When x + z does not contain 
a negative cell, we say that z is applicable to x. z is applicable to x if and only if 
z~ < x (inequality for each element). Given two contingency tables x, y let \x — y\ = 
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Ylij \ x ij ~ Vij\ denote the /^-distance between x and y. For x and y in the same fiber, 
we say that z reduces their distance if z or — z is applicable to x or y and the distance 
\x — y\ is reduced by the application, e.g. \x + z — y\ < \x — y\. A sufficient condition 
for z to reduce the distance between x and y is that at least one of the following four 
conditions hold: 

(i) z + < x, mxa(z~,y) ^ 0, (ii) z + < y, min(z~,a;) ^ 0, 
(iii) z~ < x, min(z + ,y) ^ 0, (iv) z~ < y, mm(z , x) ^ 0, 

where "min" denotes element-wise minimum. We can also think of reducing the distance 
by a sequence of moves from B. Clearly a finite set of moves B is a Markov basis if for 
every two tables x, y from every fiber, we can reduce the distance \x — y\ by a move z 
or a sequence of moves Z\, . . . ,Zk from B. We use the argument of distance reduction for 
proving Theorem [1] in the next section. 

We end this section with a known fact for the structural zero problem. In order to 
state it we introduce two types of moves. In these moves, the non-zero elements are 
located in the complement S c of 5", i.e., they are in the off-diagonal cells. 

• Type I (basic moves in S c for max(i?, C) > 4): 

3 f 
i +1 -1 

%' -1 +1 

where are all distinct. 

• Type II (indispensable moves of degree 3 in S c for min(i?, C) > 3): 





% 


l' 


l" 


% 





+1 


-1 


%' 


-1 





+1 


i" 


+1 


-1 






where three zeros are on the diagonal. 



Lemma 1. lAoki and Takemurd , \200a . Section 5] Moves of Type I and II form a min- 
imal Markov basis for the structural zero problem along the diagonal, i.e., xa = 0, 
? = !,..., min(i?, C). 



4 A Markov basis for the common diagonal effect 
model 

In order to describe a Markov basis for the diagonal sum problem, we introduce four 
additional types of moves. 
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• Type III (dispensable moves of degree 3 for min(_R, C) > 3): 



2+10-1 

%' -1 +1 
i" -1 +1 

Note that given three distinct indices i, i', i", there are three moves in the same fiber: 

+ 1 0-1 +1-10 -1 +1 

-1 +1 -1 +1 -1 +1 

-1+10 +1 -1 +1 -1 

Any two of these suffice for the connectivity of the fiber. Therefore we can choose 
any two moves in this fiber for minimality of Markov basis. 

• Type IV (indispensable moves of degree 3 for max(i?, C) > 4): 





i 


i J 


i 


+ 1 


-1 







-1 +1 


f 


-1 


+ 1 



where i, i', j, j' are all distinct. We note that Type IV is similar to Type III but 
unlike the moves in Type III, the moves of Type IV are indispensable. 

• Type V (indispensable moves of degree 4 which are non-square free): 

J f f 
i +1 +1 -2 
i' -1 -1 +2 

where i = j and i' = j', i.e., two cells are on the diagonal. Note that we also include 
the transpose of this type as Type V moves. 

• Type VI: (square free indispensable moves of degree 4 for max(i?, C) > 4): 

j f .i" ./"" 
% +1 +1 -1 -1 

a -i -i +i +i 

where i = j and i' = j'. Type VI includes the transpose of this type. 

We now present the main theorem of this paper. 

Theorem 1. The above moves of Types I- VI form a Markov basis for the diagonal sum 
problem with min(i?, C) > 3 and m&x(R, C) > 4. 
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Proof. Let X, Y be two tables in the same fiber. If 

xu = yu, Vi = l,...,mm(R, C), 

then the problem reduces to the structural zero problem and we can use Lemma [TJ There- 
fore we only need to consider the difference 

X — Y = Z = { Zij }, 

where there exists at least one % such that za ^ 0. Note that in this case there are two 
indices i ^ i' such that 

zu > 0, Zi>i> < 0, 

because the diagonal sum of Z is zero. Without loss of generality we let % = 1, i' — 2. We 
prove the theorem by exhausting various sign patterns of the differences in other cells and 
confirming the distance reduction by the moves of Types I- VI. We distinguish two cases: 
Zi 2 z 2 i > and 212221 < 0. 

Case 1 (-612221 > 0): In this case without loss of generality assume that z 12 > 0, 2 2 i > 0. 
Let 0+ denote the cell with non-negative value of Z and let * denote a cell with arbitrary 
value of Z. Then Z looks like 

+ 0+ * ••■ 
0+ - * ••■ 



Note that there has to be a negative cell 
Z\j < 0, Zfi < 0. Then Z looks like 

1 2 

1 + 0+ 

2 0+- 

f ~ * 



on the first row and on the first column. Let 
... j ... 



If j = j', we can apply a Type III move to reduce the L\ distance. If j 7^ j', we can apply 
a Type IV move to reduce the L\ distance. This takes care of the case Z12Z21 > 0. 

Case 2 (z 12 Z2i < 0): Without loss of generality assume that ,212 > 0, z 2 \ < 0. Then Z 
looks like 

+ + *••• 
— — * ... 

* * * • • • 
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There has to be a negative cell on the first row and there has to be a positive cell on the 
second row. Without loss of generality we can let z 13 < and at least one of ^23,^24 is 
positive. Therefore Z looks like 

+ + — * * 
— — * + * 



These two cases are not mutually exclusive. We look at Z as the left pattern whenever 
possible. Namely, whenever we can find two different columns j,j' > 3, j 7^ j' such that 
z\jZ 2 j' < 0, then we consider Z to be of the left pattern. We first take care of the case 
that Z does not look like the left pattern of (J3J), i.e., there are no j,j' > 3, j 7^ j', such 
that ZijZ2j> < 0. 

Case 2-1 (Z does not look like the left pattern of (131)): If there exists some j > 4 such 
that zij < 0, then in view of z 2 3 > we have z\jZ^ < and Z looks like the left pattern 
of ([3]). Therefore we can assume 

zij > 0, Vj > 4. 

Similarly 

z 2j < 0, Vj > 4 

and Z looks like 

+ + - 0+ ■■• 0+ 
- - + 0- ••• 0- 

Because the first row and the second row sum to zero, we have 

Z13 < "2, z 23 > 2. 

However then we can apply Type V move to reduce the L\ distance. 

Case 2-2 {Z looks like the left pattern of ([3])): Suppose that there exists some i > 3 such 
that Zi3 > 0. If Z33 > 0, then Z looks like 

+ + — * * • ■ • 
— — * _)_*... 

* * + * * • • ■ 



Then we can apply a type III move involving 

Z12 > 0, Z13 < 0, z 22 < 0, 2 2 4 > 0, 2 33 > 0, Z34 : arbitrary 



or 



— * 
+ * 



(3) 



S 



and reduce the L\ distance. On the other hand if > for % > 4, then Z looks like 



+ + — * * 
— — * + * 



Then we can apply a type IV move involving 

Zu > 0, Zi 3 < 0, z 2 i < 0, z 2 4 > 0, z i3 > 0, Zu : arbitrary 
and reduce the L\ distance. Therefore we only need to consider Z which looks like 



+ 



* 0- 



+ * 



* * 0— * * 

Similar consideration for the fourth column of Z forces 

+ + — * * 

— — * + * 

* * 0— 0+ * 

* * 0— 0+ * 

However then because the third column and the fourth column sum to zero, we have 
-223 > and < and Z looks like 

+ + - - * 

- - + + * 
* 



0- 0- 



* * 0— 0+ * 

Then we apply Type VI move to reduce the L\ distance. 

Now we have exhausted all possible sign patterns of Z and shown that the L\ distance 
can always be decreased by some move of Types I- VI. □ 

Since moves of Type I, II, IV, V and VI are indispensable, we have the following 
corollary. 

Corollary 1. A minimal Markov basis for the diagonal sum problem with min(i?, C) > 3 
and max(i?, C) > 4 consists of moves of Types I, II, IV, V, VI and two moves of Type 
III for each given triple 
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5 Numerical examples 



In this section with the Markov basis computed in previous sections, we will experiment 
via MCMC method. Particularly, we test the hypothesis of CDEM for a given data set. 
Denote expected cell frequencies under the quasi-independence model and CDEM by 

- QI ~QI - S ~S 

respectively. These expected cell frequencies can be computed via the iterative propor- 
tional fitting (IPF) . IPF for the quasi-independence model is explained in Chapter 5 of 



Bishop et al.l [19751 ] . IPF for the common diagonal effect model is given as follows. The 



superscript k denotes the step count. 

1. Set m^ k = mfj k ~ 1 Xi + /m^~ 1 for all i, j and set k — k + 1. Then go to Step 2. 

2. Set m^ h = mfj k ~ 1 Xi + /m^~ 1 for all i, j and set k — k + 1. Then go to Step 3. 

o n l. S.k S,k—1 i S,k—1 r li • i • / r> /"»\ J S.k S,k—1/ 

3. Set = xs/m s for all i = 1, . . . , mm(it, C) and = (n — 
m s ,k ~ 1 ) I \ n ~ x s) f° r an i 7^ 3- Then set k — k + 1 and go to Step 1. 

After convergence we set 

rhf; = mf- h for all i, j. 

We can initialize m s '° by 

mfj° = n/(R ■ C) for all i, j. 

As the discrepancy measure from the hypothesis of the common diagonal model, we 
calculate (2x) the log likelihood ratio statistic 

- QI 

i j v 

for each sampled table x = {xij}. 

In all experiments in this paper, we sampled 10,000 tables after 8,000 burn-in steps. 

Example 1. The first example is from Table U\ of Section^ The value of G 2 for the 
observed table in Table[l\is 13.5505 and the corresponding asymptotic p-value is 0.003585 
from the asymptotic distribution 

A histogram of sampled tables via MCMC with a Markov basis for Tabled is in Figure 
[11 We estimated the p-value 0.00379 via MCMC with the Markov basis computed in this 
paper. Therefore CDEM model is rejected at the significance level of 5%. 



Example 2. The second example is Table 2.12 from \Agresti }2002\j . Tabled summarizes 
responses of 91 married couples in Arizona about how often sex is fun. Columns represent 
wives 7 responses and rows represent husbands' responses. 

The value of G 2 for the observed table in Tabled is 6.18159 and the corresponding 
asymptotic p-value is 0.1031 from the asymptotic distribution xl- 
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Figure 1: A histogram of sampled tables via MCMC with a Markov basis computed for 
Table [TJ The black line shows the asymptotic distribution x\- 

Table 2: Married couples in Arizona 
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14 



A histogram of sampled tables via MCMC with a Markov basis for TablelEis in Figure 
[E We estimated the p-value 0.12403 via MCMC with the Markov basis computed in this 
paper. Therefore CDEM model is accepted at the significance level of 5%. We also see 
that x§ approximates well with this observed data. 



Example 3. The third example is Table 1 from lDiaconis and SturmfeU 1199a] . Tabled 
shows data gathered to test the hypothesis of association between birth day and death day. 
The table records the month of birth and death for 82 descendants of Queen Victoria. A 
widely stated claim is that birthday- death day pairs are associated. Columns represent 
the month of birth day and rows represent the month of death day. As discussed in 



Diaconis and Sturmfelk 1199a] , the Pearson's \ 2 statistic for the usual independence model 



is 115.6 with 121 degrees of freedom. Therefore the usual independence model is accepted 
for this data. However, when CDEM is fitted, the Pearson's x 2 becomes 111.5 with 120 
degrees of freedom. Therefore the fit of CDEM is better than the usual independence 
model. 
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Figure 2: A histogram of sampled tables via MCMC with a Markov basis computed for 
Table [2j The black line shows the asymptotic distribution x|. 



Table 3: Relationship between birthday and death day 
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We now test CDEM against the quasi-independence model. The value of G 2 for the 
observed table in Tabled is 6.18839 and the corresponding asymptotic p-value is 0.860503 
from the asymptotic distribution xli ■ 

A histogram of sampled tables via MCMC with a Markov basis for Tabled is in Figure 
We estimated the p-value 0.89454 via MCMC with the Markov basis computed in 
this paper. There exists a large discrepancy between the asymptotic distribution and the 
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distribution estimated by MCMC due to the sparsity of the table. 




Figure 3: A histogram of sampled tables via MCMC with a Markov basis computed for 
Table [31 The black line shows the asymptotic distribution xfi- 



6 Concluding remarks 

In this paper we derived an explicit form of a Markov basis for the diagonal sum problem. 
With this Markov basis we showed that we can easily run the conditional test of the 
common diagonal effect model. As seen from Figure [3] in Example [31 there may exist a 
large discrepancy between the asymptotic distribution and the distribution estimated via 
MCMC. This suggests the efficiency of the conditional test with a Markov basis especially 
for a sparse table like Ta ble [31 



In lHara et al.l [2007bl ] we gave a necessary and sufficient condition on the subtable S 
so that the set of square-free moves of degree two forms a Markov basis for S. For a 
general S it seems to be difficult to explicitly describe a Markov basis. For the diagonal 
S the Markov basis in Theorem [1] turned out to be relatively simple. It would be helpful 
to consider some other special type of S in order to understand Markov bases for totally 
general S. 

We have stated Theorem [1] for the case that S contains all the diagonal elements (i, i), 
i = 1, . . . , min(i?, C). Actually our proof shows that our result can be generalized to S 
which is a subset of the diagonal cells. Furthermore we can relabel the rows and the 
columns. Therefore the essential condition for the result in this paper is that S contains 
at most one cell in each row and each column of the R x C table. 
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Theorem [T] was stated for the case min(i?, C) > 3 and max(i?, C) > 4. For smaller 
tables, we just omit moves, which can not fit into small tables. For completeness we list 
these cases and give a Markov basis for each case. For avoiding triviality, we assume 
mm(R, C) > 2. 

1. 2x2: CDEM is the same as the saturated model and no degrees of freedom is left 
for the moves 

2. 2x3: Type V moves form a Markov basis. 

3. 2 x C, C > 4: Moves of Type I, V and VI form a Markov basis. 

4. 3x3: Moves of Type II, III and V form a Markov basis. 

It may be interesting and important to extend the subtable sum or/and diagonal sum 
problems to higher dimensional tables. However this seems to be difficult at this point 
and is left for our future studies. 
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